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I Abstract. This paper presents a new algorithm to perform regression estimation, in both the inductive and trans- 

. ductive setting. The estimator is defined as a linear combination of functions in a given dictionary. Coefficients of the 

combinations are computed sequentially using projection on some simple sets. These sets are defined as confidence 
regions provided by a deviation (PAC) inequality on an estimator in one- dimensional models. We prove that every 
projection the algorithm actually improves the performance of the estimator. We give all the estimators and results 
04 ' at first in the inductive case, where the algorithm requires the knowledge of the distribution of the design, and then 

^ , in the transductive case, which seems a more natural application for this algorithm as we do not need particular 

information on the distribution of the design in this case. We finally show a connection with oracle inequalities, 
making us able to prove that the estimator reaches minimax rates of convergence in Sobolev and Besov spaces. 



Resume. Cette article presente un nouvel algorithme d'estimation de regression, dans les contextes inductifs et 
1/^ . transductifs. L'estimateur est defini par une combinaison lineaire de fonctions choisies dans un dictionnaire donne. 

' Les coefficients de cette combinaison sont calcules par des projections successives sur des ensembles simples. Ces 

ensembles sont definis comme des regions de confiance donnees par une inegalite de deviation (ou inegalite PAC). On 
demontre en particulier que chaque projection au cours de I'algorithme ameliore effectivement l'estimateur obtenu. On 
. donne tout d'abord les resultats dans le contexte inductif, oil I'algorithme necessite la connaissance de la distribution 

, du design, puis dans le contexte transductif, plus naturel ici puisque I'algorithme s'applique sans la connaissance 

de cette distribution. On etablit finalement un lien avec les inegalites d'oracle, permettant de montrer que notre 
estimateur atteint les vitesses optimales dans les espaces de Sobolev et de Besov. 
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1. The setting of the problem 

We give here notations and introduce the inductive and transductive settings. 
1.1. Transductive and inductive settings 

Let {X ,B) be a measure space and let denote the Borel cr-algebra on R. 
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1.1.1. The inductive setting 

In the inductive setting, we assume that P is a distribution on pairs Z = (X, Y) taking values in ( A" x R, B (g) 
Sr), that P is such that: 

<oo, 

and that wc observe N independent pairs Zi = {Xi, Yi) for i e {1, . . . , N}. Our objective is then to estimate 
the regression function on the basis of the observations. 

Definition 1.1 (The regression function). We denote: 

x^P{Y\X^x). 

1.1.2. The transductive setting 

In the transductive case, we will assume that, for a given integer fc > 0, P[k+i)N is some exchangeable 
probability measure on the space {{X x ,{B Br)'^^^). We will write (X^, yi)i=i,...,(fe+i)7V = 

(^i)i=i....,(fc+i)JV a- random vector distributed according to P(k+i)N- 

Definition 1.2 (Exchangeable probability distribution). For any integer j, let 6j denote the set 
of all permutations of {!,..., j}. We say that P(k+i)N is exchangeable if for any a G &(k+i)N we have: 
(Xo.(,;),y^(i))i=i_...^(fc+i)jv has the same distribution under P(^k+i)N that {Xi,Yi)i=i^,,,^(^k+i)N ■ 

We assume that we observe {Xi,Yi)i^i and (Xi)i=Ar+i,...,(fc+i)Ar; and the observation (Xi, i^i)i=i,...,(fc+i)7v 

is usually called the training sample, while the other part of the vector, (^i, li)i=Ar+i....,(fc+i)7v is called 
the test sample. In this case, we only focus on the estimation of the values (yOj=-/v+i,....(fe+i)w- This is why 
Vapnik (22) called this kind of inference "transductive inference" when he introduced it. 

Note that in this setting, the pairs {Xi,Yi) are not necessarily independent, but are identically distributed. 
We will let P denote their marginal distribution, and we can here again define the regression function /. 

Actually, most statistical problems being usually formulated in the inductive setting, the reader may 
wonder about the pertinence of the study of the transductive setting. Let us think of the following examples: 
in quality control, or in a sample survey, we try to infer informations about a whole population from 
observations on a small sample. In this cases, transductive inference seems actually more adapted than 
inductive inference, with N the size of the sample and (fc + 1)A^ the size of the population. One can see 
that the use of inductive results in this context is only motivated by the large values of k (the inductive 
case is the limit case of the transductive case where k +oo). In the problems connected with regression 
estimation or classification, we can imagine a case where a lot of images are collected for example on the 
internet. The time to label every picture according to the fact that it represents, or not, a given object being 
too long, one can think of labeling only 1 over fc + 1 images, and to use then a transductive algorithm to 
label automatically the other data. We hope that these examples can convince the reader that the use of 
the transductive setting is not unrealistic. However, the reader that is not convinced should remember that 
the transductive inference was first introduced by Vapnik mainly as a tool to study the inductive case: there 
are techniques to get rid of the second part of the sample by taking an expectation with respect to it and 
obtain results valid in the inductive setting (see for example a result by Panchenko used in this paper, (17)). 

1.2. The model 

In both settings, we are going to use the same model to estimate the regression function: O. The only thing 
we assume about O is that it is a vector space of functions. 

Note in particular that we do not assume that / belongs to 0. 
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1.3. Overview of the results 

In both settings, we give a PAC inequality on the risk of estimators in one-dimensional models of the form: 

{a0{-),aeR} 
for a given 6 G 0. 

This result motivates an algorithm that performs iterative feature selection in order to perform regression 
estimation. We will then remark that the selection procedure gives the guarantee that every selected feature 
actually improves the current estimator. 

In the inductive setting (Section 2), it means that we estimate /(•) by a function 6{-) £ 0, but the selection 
procedure can only be performed if the statistician knows the marginal distribution P{x) of X under P. 

In the transductive case (Section 3), the estimation of VAr+i, . . . ,y(fc+i)Ar can be performed by the proce- 
dure without any prior knowledge about the marginal distribution of X under P. We first focus on the case 
k = 1, and then on the general case k gW . 

Finally, in Section 4, we use the main result of the paper (the fact that every selected feature improves the 
performance of the estimator) as an oracle inequality, to compute the rate of convergence of the estimator 
in Sobolev and Besov spaces. 

The last section (Section 5) is dedicated to the proofs. 

The literature on iterative methods for regression estimation is very important, let us mention one of the 
first algorithm, AdaLine, by Widrow and Hoff (23), or more recent versions like boosting, see (18) and the 
references within. The technique developed here has some similarity with the so-called greedy algorithms, see 
(3) (and the references within) for a survey and some recent results. However, note that in this techniques, 
the iterative update of the estimator is motivated by algorithmic issues, and is not motivated statistically. In 
particular, AdaLine has no guarantee against overfitting if the number of variables m is large (say m = N). 
For greedy algorithms, on has to specify a particular penalization if one wants to get a guarantee against 
overfitting. The same remark can be done about boosting algorithm. Here, the algorithm is motivated by 
a statistical result, and as a consequence has theoretical guarantees against overlcarning. It stays however 
computationally feasible, some pseudo-code is given in the paper. 

Closer to our technique are the methods of aggregation of statistical estimators, see (16) and (21) and 
more recently the mirror descent algorithm studied in (13) or (14). In this papers, oracle inequalities are 
given ensuring that the estimator performs as well as the best (linear or convex) aggregation of functions in 
a given family, up to an optimal term. Note that these inequalities are given in expectation, here almost all 
results are given in a deviation bound (or PAC bound, a bound that is true with high probability, from which 
we derive a bound in expectation in Section 4). Similar bounds where given for the PAC-Bayesian model 
aggregation developed by Catoni (7), Yang (24) and Audibert (2). In some way, the algorithm proposed in 
this paper can be seen as a practical way to implement these results. 

Note that nearly all the methods in the papers mentioned previously where designed especially for the 
inductive setting. Very few algorithms were created specifically for the transductive regression problem. 
The algorithm described in this paper seems more adapted to the transductive setting (remember that the 
procedure can be performed in the inductive setting only if the statistician knows the marginal distribution 
of X under P, while there is no such assumption in the transductive context). 

Let us however start with a presentation of our method in the inductive context. 

2. Main theorem in the inductive case, and application to estimation 

2.1. Additional definition 
Definition 2.1. We put: 

R{e)^p[{Y-e{x))\ 
1 ^ 

r{e)^-Y.^Y,-e{x,))\ 

i=l 
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and in this setting, our objective is 6 given by: 

9 e argmini?(0). 

see 

2.2. Main theorem 

We suppose that we have an integer m € N and that we are given a finite family of functions: 

Oo^{ei,...,em}(ie. 

Definition 2.2. Let us put, for any k £ {1,. . . , m}: 
ak = argmini?(a6lfe) = , 

. I {i/N)Y.tiek{x,)Y, 

ak = argmmrfa^fc ) = -rr , 

Theorem 2.1. Moreover, let us assume that P is such that \ f\ is bounded by a constant B, and such that: 

P{[Y~f{X)f}<<j^<+^. 
We have, for any e > 0, with P^^ -probability at least 1 — e, for any k £ {1, . . . ,m}: 

RiCkOtkOk) - R{akOk) < 



4[l + log(2m/e^i //vnY-^ a.fY.\2v2 



h -D + a 



N [ pmxr] 

The proof of this theorem is given in Section 5.6. 
2.3. Application to regression estimation 

2.3.1. Interpretation of Theorem 2.1 in terms of confidence intervals 
Definition 2.3. Let us put, for any {6,9') e O"^ : 



(2.1) 



^{9,9') = ^P^^x)mx)-9'{X)f 



Let also \\ ■ \\p denote the norm associated with this distance, \\9\\p ~ dp{9, 0), and (•, •) p the associated scalar 
product: 

{9,9')p = P[9{X)9'{X)]. 

Because oik = a-i'gminaeR we have: 

R{Ckak9k)~R{ak9k)^dl{Ckak9k:ak9k). 
So the theorem can be written: 

P®^{Vfc e {1, . . . , m}, dl{Ckak9k,ak9k) < fi{e, k))>l- e, 
where P{e,k) is the right-hand side of inequahty (2.1). 
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Now, note that ctk^k is the orthogonal projection of: 
9 — argmin i?(6') 

onto the space {aOk, a € M}, with respect to the inner product (•, ■)p: 
Sfc = arg min dp {a6k , 0) ■ 

Definition 2.4- We define, for any k and e: 



cn{k,e) ^{eeo 



9 - CkOikOk, 7-z\- 



\\Vk\\P / p 

Then the theorem is equivalent to the following corollary. 
Corollary 2.2. We have: 

P®^[yke{\,...,m},'eeCn{k,e)] > 1 - e. 

In other words: Plfceli m}^'^i^'^) ^ confidence region at level e for 9. 
Definition 2.5. We write 11 p'^ the orthogonal projection into CTZ{k,e) with respect to the distance dp. 

Note that this orthogonal projection is not a projection on a linear subspace of 0, and so it is not a linear 
mapping. 

2.3.2. The algorithm 

The previous corollaries of Theorem 2.1 motivate the following iterative algorithm: 

• choose 9^^'> G 0, for example, 0'°) — 0; 

• at step n& N*, we have: 9'-^\ . . . ,9^'^~^\ Choose k{n) € {!,..., m} (this choice can of course be data 
dependent), and take: 

^(„)^ jfe(«),eg(„-l). 

• we can use the following stopping rule: \\9^'^^^'> — 0("^||p < k, where < k < 

Definition 2.6. Let uq denote the stopping step, and: 
0"(.) = 6i(»o)(.) 

the corresponding function. 

2.3.3. Results and comments on the algorithm 
Theorem 2.3. We have: 

P®^[V7iG {l,...,7io},i?(6i(")) <i?(6i("~i))-4(6'("),6'("-^))] > l-£. 

Proof. This is just a consequence of the preceding corollary. Let us assume that: 

Vfc e {1, . . . ,m}, R{Ckak9k) - R{ak9k) < f3{e, k). 
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Let us choose n G { 1 , . . . , ?io} . Wc have, for a fc S { 1 , . . . , to} : 

gin) ^nk^^Qin-l)^ 

where TTp"^ is the projection into a convex set that contains 6. This imphes that: 
(6»(«) _6i("-i)^0_6)("))^>O, 

or: 

which can be written: 

i?[6'("-i)] - R(e) > i?[6i(")] - R(e) + d%{0^"-^\e^"'>). □ 

Actually, the main point in the motivation of the algorithm is that, with probability at least 1 — e, 
whatever the current value 6''^"' € 0, whatever the feature fee {!,..., m} (even chosen on the basis of the 
data), 7Tp'^6'(") is a better estimator than 6'("). 

So we can choose k{n) as we want in the algorithm. For example. Theorem 2.3 motivates the choice: 

fc(n) =argmax4(e'^"-l^C7^(fc,e)). 

k 

This version of the algorithm is detailed in Fig. 1. If looking for the exact maximum of 
dp(6'("-l^C7^(fc,e)) 

with respect to k is too computationally intensive we can use any heuristic to choose fc(n), or even skip this 
maximization and take: 

A;(l) = 1, . . . , fc(m) = TO, fc(TO + 1) = 1, . . . , /c(2to) = TO, . . . . 

Example 2.1. Let us assume that X = [0, 1] and let us put O = 'L.2{P(x))- Let {9k)k&i* be an orthonormal 
basis of 0. The choice of to should not be a problem, the algorithm itself avoiding itself overlearning we 
can take a large value of m like m ~ N . In this setting, the algorithm is a procedure for (soft) thresholding 
of coefficients. In the particular case of a ujavelets basis, see (10) or (15) for a presentation of wavelets 
coefficient thresholding. Here, the threshold is not necessarily the same for every coefficient. We can remark 
that the sequential projection on every k is sufficient here: 

k{l) = 1, . . . , fc(m) = TO, 

after that = ^(™) for every n G N (because all the directions of the different projections are orthogo- 

nals). 

Actually, it is possible to prove that the estimator is able to adapt itself to the regularity of the function to 
achieve a good mean rate of convergence. More precisely, if we assume that the true regression function has 
an (unknown) regularity /3, then it is possible to choose to and e in such a way that the rate of convergence 
is: 

^-2/3/(2^+1) l^g^_ 

We prove this point in Section 4. 
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We have e > 0, k>0, N observations (Xi, Yi), . . . , {X^, Y^), m features 
9i{-), . . .,6m{-) and c = (ci, . . . ,c„i) = (0, . . . ,0) G K™. Compute at first 
every and f3{e, k) for k G {1, . . . Set n <— 0. 

Repeat: 

• set n ^ ?7. + 1 ; 

• set best improvement <— 0; 

• for /c e {1, to}, compute: 

Vk = P[Bk{X)\ 

5k^Vk{\ik\ -/3(£,fc))+, 

and if 5k > best_improvement, set: 

bestjmprovement <— (5/;, 
k{n) <— fc; 

• if bestjmprovement > set: 

Cfc(„) ^Cfc(„) +sgn(7i,(„))(|7fc(„)| -/?(£, 

until best_improvement < k (where sgn(a;) = — 1 if .t < and 1 other- 
wise). 

Note that at each step n, 0^"^ is given by: 

m 
k=l 

SO after the last step we can return the estimator: 

m 

fe=i 



Fig. 1. Detailed version of the feature selection algorithm. 



Remark 2.1. Note that in its general form, the algorithm does not require any assumption about the 
dictionary of functions Oq = {6i, . . . ,9„i}. This family can be non- orthogonal, it can even be redundant (the 
dimension of the vector space generated by Oq can be smaller than m). 

Remark 2. 2. It is possible to generalize Theorem 2. 1 to models of dimension larger than 1 . The algorithm 
itself can take advantage of these generalizations. This point is developed in (1), where some experiences 
about the performances of our algorithm can also be found. 
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2.4- Additional notations for some refinements of Theorem 2.1 

Note that an improvement of the mequahty in Theorem 2.1 (inequahty (2.1)) would allow to apply the same 
method, but would lead to smaller confidence regions and so to better performances. The end of this section 
is dedicated to improvements (and generalizations) of this bound. 

Hypothesis. Until the end of Section 2, we assume that and P are such that: 

ye e e, Pexp[6'(x)r] < +00. 

Definition 2. 7. For any random variable T we put: 

V{T)^P[{T- PTf], 
M^{T)^P[{T- PTf], 

and we define, for any 7 > 0, P^t by: 

dP ~ p[cxp(7r)] ■ 

For any random variables T, T' and any j >0 we put: 



Section 2.5 gives an improvement of Theorem 2.1 while Section 2.6 extends it to the case of a data- 
dependant family Oq- 

2.5. Refinements of Theorem 2.1 

Theorem 2.4. Let us put: 



V^t{T') = P^t[[T' -P^tT')\ 
M^TiT') = P^,T[{T' -P^rT'f]. 



We^e{X)Y - P{d{X)Y). 



Then we have, for any e > 0, with P' 



■probability at least 1 — e, for any fee {1, . . . , to}; 



RiCkatek) ~ R{ak9k) < 



21og(2TO/£) VjWe,) \og\2m/e) 
N P[Ok{X)^] N^/^ 



CN{P,m,e,9k), 



where we have: 





21og(2m/e)y log2(2TO/e) 



NV{Wg,) J VNviWe.rPMxy 



with: 
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For the proof, sec Section 5.1. 

Actually, the method we proposed requires to be able to compute explicitly the upper bound in this 
theorem. Remark that, with s and m fixed: 



CN{P,m,s,0k) 



nZ^^ 9V{W8,f/^p[ekixy 



and so we can choose to consider only the first-order term. Another possible choice is to make stronger 
assumptions on P and Oq that allow to upper bound explicitly CNiPTm,s,6k). For example, if we assume 
that Y is bounded by Cy and that O^i-) is bounded by then Wg^ is bounded by Ck ~ 2CyC'^. and we 
have (basically): 



Cw(P,m,e,0fe)< 



4096C|log'^(2m/£) 

W{We,f'^P[ek{XY] + %lVNV{We,fP[ek[XY] 



The main problem is actually that the first-order term contains the quantity V{Wg^, ) that is not observable, 
and we would like to be able to replace this quantity by its natural estimator: 



N 



N 



i=i L j=i 

The following theorem justifies this method. 
Theorem 2.5. If we assume that there is a constant c such that: 

Vfce {!,..., to}, P[cxp{cW^J]<oo, 
we have, for any e > 0, with P®^ -probability at least 1 — e, for any k S {1, . . . ,m}: 



R{Ckak9k) ~ RiakOk) < 
where we have: 



2l0g(4TO/£) Vk 

N p[9k{xy 



l0g(4TO/£) 



^'^=N^ -nT. YMXo) 



3 = 1 



C'r^ (P, TO, e, = Cat P, m, - , J log 



4to 



2l0g^/^(2TO/£) 

P[Ok{X?] 

2l0g^/^(4TO/£) 

P[Ok{XY] 

N 



l0g(2TO/£) 

NV{Wl) 



VNv{Wg,y 



/2l0g(2?7l/£) 

NViWl) 

2log{Am/e) 
/ NV{We,) 



N 



i=l 



21/(W^e J log(4m/£) log^/2(2TO/£) / /21og(4?7i/£) 



N 



Nv{WB,y 
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The proof is given in Section 5.1. 
2.6. An extension to the case of Support Vector Machines 

Thanks to a method due to Secger (20), it is possible to extend this method to the case where the set 0o is 
data dependent in the foUowing way: 

N 

Oo{Z^,...,ZM,N)^[je^{Z,,N), 
1=1 

where for any z S A" x M, the cardinahty of the set 0o{z,N) depends only on N, not on z. We will write 
m'{N) this cardinality. So we have: 

\ea{Z^,...,ZN.N)\<N\e^{Z,,N)\^Nm'{N). 

We put: 

0o{Zi, N) — {OiS, • ■ • , Si^m'iN)}- 

In this case, we need some adaptations of our previous notations. 
Definition 2. 8. We put, for i e {I, . . . ,N} : 

ie{i,...,w}, 

For any (i, k) g {1, . . . , N} x {!,..., m'(A^)}, we write: 



Ai^k = argminri(a6'i,fc) 
ai^k = argmini?(Q;6'j^fc) 



P[e^^k{X)Y] 



Theorem 2.6. We have, for any e > 0, with P®^ -probability at least 1 ~ e, for any k E {1, ... ,m' (N)} and 
i€{l,...,N}: 

n(r ^ ft \ R(- ft ^ . 21og(2iV™'(iV)/£) V{We^,,) 



N-i P[e,AxY 

\og^{2Nm'{N)/e) 



(iV- 1)3/2 



CN^i{P,Nm'{N),eAM)- 



The proof is given in Section 5.1. 

We can use this theorem to build an estimator using the algorithm described in the previous subsection, 
with obvious changes in the notations. 

Example 2.2. Let us consider the case where Ti. is a Hilbert space with scalar product (■,■), and: 

= {0{-) = {h,<ir{-)),heH} 
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where W is an application X ^ O. Let us put 6'o[(x, y), A^] = {(S'(x), !?'(•))}. In this case we have m'(N) — 1 
and the estimator is of the from: 

N 
i=l 

Let us define, 

K{x,x') = {<F{x),>l'{x')), 
the function K is called the kernel, and: 

I = {l<t<N: a,,i7^0}, 

that is called the set of support vectors. Then the estimate has the form of a support vector machine (SVM): 

SVM where first introduced by Boser, Guyon and Vapnik (5) in the context of classification, and then 
generalized by Vapnik (22) to the context of regression estimation. For a general introduction to SVM, see 
also (6) and (9). 

Example 2.3. A widely used kernel is the Gaussian kernel: 

r. I ,^ ( d^{x,x')\ 
ii:^(a;,a; ) = expl -7 I, 

where d{-,-) is some distance over the space X and 7 > 0. But in practice, the choice of the parameter 7 
is difficult. A way to solve this problem is to introduce multiscale SVM. We simply take as the set of all 
bounded functions X Now, let us put: 

0o[(x, y),N]= {K2{x, ■), K22 {X, •),..., i^2'"'(«) (a;, •)}• 
In this case, we obtain an estimator of the form: 

m'(N) 

k=i ieik 

that could be called multiscale SVM. Remark that we can use this technique to define SVM using simultane- 
ously different kernels (not necessarily the same kernel at different scales). 

3. The transductive case 

3.1. Notations 

Let us recall that we assume that fc G N*, that P(^k+i)N is some exchangeable probability measure (let us 
recah that exchangeability is defined in Definition 1.2) on the space {{X x R)('=+i)^, {B x 3^)^'-''+^^^). Let 
(Xi, yi)j^2,...,(fe+i)Af = {^i)i=i,...,{k+i)N denote a random vector distributed according to P(k+i)N- 

Let us remark that under this condition, the marginal distribution of every Zi is the same, we will call P 
this distribution. In the particular case where the observations are i.i.d., we wiU have P(^k+i)N = P^e^+i)^, 
but what follows still holds for general exchangeable distributions P(^k+i)N- 

We assume that we observe (X^, 1^)^=1^. ..^jv and (^i)i=iv+i,....(A;+i)Ar. In this case, we only focus on the 
estimation of the values {Yi)i=N+i....,{k+i}N ■ 



58 



P, Alquier 



Definition 3.1. We put, for any 9^0: 

N 



1=1 

(k+l)N 

kN 



i=N+l 

Our objective is: 

02 = arg min r2 (9) , 

if the minimum of r2 is not unique then we take for 92 any element of reaching the minimum value of r2 ■ 

Let 6*0 be a finite family of vectors belonging to 0, so that \0o\ ~ m. Actually, 0q is allowed to be 
data-dependent: 

but we assume that the function (xi, . . . ,a;(fc+i)7v) >— > 6'o(a;i, • ■ • ,a;(fc+i)jv) is exchangeable with respect to its 
(fc + 1)A^ arguments, and is such that m = m{N) depends only on N, not on {Xi, . . . 

The problem of the indexation of the elements of 0o is not straightforward and we must be very careful 
about it. Let be a complete order on 0, and write: 

00 = {^l7 • • • , ^m}, 
where 

9i <e • • • <e 9m- 

Remark that, in this case, every 9ii is an exchangeable function of (Xi, . . . ,^(fc+i)7v)- 
Definition 3.2. Now, let us write, for any h G {1, . . . ,to}; 

h ■ t n \ J2f=l(^h{Xi)Yi 

= argmmri(a0,O = V n r ^ ^9 ' 

^{k+l)N 



= argmmr2(a0„) = (fe+i)Ar 



3.2. Basic results for fc = 1 

In a first time we focus on the case where k = 1 as a method due to Catoni (6) brings a substantial 
simplification of the bound in this case. 



Theorem 3.1. We have, for any e > 0, with P2N -probability at least I — s, for any h € {1, . . . ,m}: 
r2[(CV)-0J-r2K-0,.)<4 



i^/N)j:Z(^k{X,)%n\og{2m/e) 



N 
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Remark 3.1. Here again, it is possible to make some hypothesis in order to make the right-hand side of 
the theorem observable. In particular, if we assume that: 

3BeR+, P{\Y\<B) = 1, 

then we can get a looser observable upper bound: 



P2N{yk G {1, . . .,m},r2[iC''a1) ■ 6^] - r2(a^ • 0/.) < 4 



log(2m/e) 
N 



> 1 - £. 



If we do not want to make this assumption, we can use the following variant, that gives a first- order approx- 
imation for the bound. 

Theorem 3.2. For any e > 0, with P2N -p^'obability at least 1 — e, for any h e {1, . . . ,m}; 
r2[{C''a'l)-e,,]-r2ia^-9h) 
81og(4TO/£) 



< 



N 



2v2 



{l/N)j:Z0hiX,)%'logi2m/e) 
2N 



Remark 3.2. Let us assume that Y is such that we know two constants by and By such that: 

Pcxp{by\Y\) < By < 00. 
Then we have, with probability at least 1 — e; 

1 2NBy 

sup \Y,\<— log . 

ie{l,...,2N} oy e 

Combining both inegualities leads by a union bound argument leads to: 
r2[(CV^)•0„]-r2(a^0,,) 

8l0g(877l/£) 



< 



N 



2\r2 



(1/iV) Qh[X,f log(4m/£) \og\\NByle) 



2m\ 



The proofs of both theorems are given in the proofs section, more precisely in Section 5.2. 

Let us compare the first-order term of this theorem to the analogous term in the inductive case (Theorems 
2.4 and 2.5). The factor of the variance term is 8 instead of 2 in the inductive case. A factor 2 is to be lost 
because we have here the variance of a sample of size 27V instead of N in the inductive case. But another 
factor 2 is lost here. Moreover, in the inductive case, wc obtained the real variance of YQh{X') instead of the 
moment of order 2 here. 

In the next subsection, we give several improvements of these bounds, that allows to recover a real 
variance, and to recover the factor 2. We also give a version that allows to deal with a test sample of 
different size, this being a generalization of Theorem 3.1 more than of its improved variants. 

We then give the analog of the algorithm proposed in the inductive case in this transductive setting. 



3.3. Improvements of the bound and general values for k 



The proof of all the theorems of this subsection is given in the next section. 
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3.3.1. Variance term (in the case k — 1) 
Wc introduce some new notations. 

Definition 3.3. We write: 

yeeo,n;2{0)^ri{e) + r2{e) 

and, in the case of a model fee {1, ■ ■ • , w}; 
a'l 2 = argminri,2(a6'/i). 

The we have the fohowing theorem. 
Theorem 3.3. We have, for any e > 0, with P2N -pfobahility at least 1 — e, for any h G {1, . . . ,m}.' 



r2(C^«?0,.)-r2(a^0,O<4 



{^lN)Y.l=i[0h{Xi)Y,~a1^^6,,{X,) 



2i2t 



\og{2m/e) 
TV ' 



For the proof see Section 5.3. 

It is moreover possible to modify the upper bound to make it observable. We obtain that with P2N- 
probability at least 1 — e, for any fc G {!,..., ni\: 



r2[(CVi')0„]-r2(aX)< 



161og(4m/£) 
N 



1 ^ 



log(m/£) 
TV 



3/2 



So we can see that this theorem is an improvement on Theorem 3.1 when some features 9h{X) are well 
correlated with Y . But we loose another factor 2 by making the first-order term of the bound observable. 

3.3.2. Improvement of the variance term (k~\) 

Theorem 3.4. We have, for any e > 0, with P2N -probability at least 1 — e, for any h G {1, . . . ,in\: 



r2{C''al9h)-r2{a'^en)< 



1 



l-21og(2m/e)/A^ 



21og(2m/£) Vi{Bh) + V2{eh) 



where: 



N 



N 



1=1 I i=l 

2Ar r 2N 



i=N+l 



j=N+l 



It is moreover possible to give an observable upper bound: we obtain that with P2N -probability at least 1 — s, 
for any fcG{l,...,m}; 



r2[iC''a1)0h]-r2ia%)< 



1 



41og(4TO/e) 



1 - 21og(4m/e)/TV 
1 

1 -21og(4m/e)/iV 



N 



2{2 + V2) 



log(6m/e) \ ^J{l/N)J:■=lO'^iX^)%' 



N 



Here again, we can make the bound fully observable under an exponential moment or boundedness 
assumption about Y . For a complete proof see Section 5.4. 
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3.3.3. The general case (keN*) 

We need some new notations in this case. 



Definition 3.4- Let us put: 

^ (k+l)N 

and, for any 9^0: 

Ye='P{mx)Y)-V{e{X)Y)f}. 
Then wc have the following theorem. 

Theorem 3.5. Let us assume that we have constants Bh and (3^ such that, for any /i G {1, ■ . ■ ,m}: 

PeMPh\OhiX,)Y,\)<B„. 
For any e > 0, with P(k+i)N probability at least 1 — e we have, for any h G {1, . . . ,m}: 
r2{C''a1eh)~r2{a%) 
< 



{1 + i/ky 



2Vf,, log(4m/e) 



TV 



16(log(4m/e))3/2(log(4(fc + l)mNBh/e))^ 64(log(4m/£))2(log(4(fc + l)mNBh/e)f 



1/2 



Here again, it is possible to replace the variance term by its natural estimator: 



w r N 



i=l L " i=i 

For a complete proof of the theorem see the section dedicated to the proofs (more precisely Section 5.5). 
3.4- Application to transductive regression 

We give here the interpretation of the preceding theorems in terms of confidence; this motivates an algorithm 
similar to the one described in the inductive case. 

Definition 3.5. We take, for any {9,9') G O'^ : 
d2{9,9') 



\ 



(k+l)N 



1- J2 mx^)~e'ix,)]' 



kN 



i=N+l 

Let also ||6'||2 = ^2(6', 0) and: 

{k+l)N 

'M')2^(^^ E WW). 



i=N+l 
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We define, for any h G {1, . . . , m} and e: 

cn{h,e) = {0 e ©: \{e-c'-'a'iehM2\ < V^M)}, 

where /3(e, h) is the upper bound in Theorem 3.1 (or in any other theorem given in the transductive section). 

For the same reasons as in the inductive case, these theorems imply the following result. 
Corollary 3.6. We have: 

P2N[y he {!,..., m} , 02 e Cn{h, £)]>!-£. 
Definition 3.6. We call TTj '" the orthogonal projection into CTl{h,e) with respect to the distance ^2- 

We propose the following algorithm: 

• choose 9^^^ € (for example 0); 

• at step n G N*, we have: 6'^"^ , . . . , 6'^"^^) . Choose h{n), for example: 

h{n) = argmax d2(6l(""^^C7^(/l, e)), 

h£{l,...,m} 

and take: 

• we can use the following stopping rule: H^*^""^) — ^'•"•'Hi ^ ^ where < k < 
Definition 3. 7. We write no the stopping step, and: 

e{.) = e^"''^-) 

the corresponding function. 

Here again we give a detailed version of the algorithm, see Fig. 2. Remark that as in the inductive case, 
we are allowed to use whatever heuristic to choose k{n) if we want to avoid the maximization. 

Theorem 3.7. We have: 

P2N[Vn e {1, . . . ,no},r2(0(")) < r2{9^"-^'>) ~ ^^(^f"), > 1 - e. 

The proof of this theorem is exactly the same as the proof of Theorem 2.3. 

Example 3.1 (Estimation of wavelet coefficients) . Let us consider the case where 0q does not depend 
on the observations. We can, for example, choose a basis of 0, or a basis of a subspace of 0. We obtain an 
estimator of the form: 

h=l 

In the case when {9k)k is a wavelet basis, then we obtain here again a procedure for thresholding wavelets 
coefficients. 
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We have e > 0, k > 0, N observations {Xi,Yi), .. ., (Xjv, Yat) and also 
Xat+i, . . . ,X(fc+i)Ar, m features 6'i(-), . . . , 6l,„(-) and c = (ci, . . . , c,„) = 
(0, . . . , 0) € K"'. First, compute every aj- and /3(e, /i) for he {1,. . . , m}. 
Set n ^ 0. 



Repeat: 



• set n <— n + 1 ; 

• set best.improvement <— 0; 

• for /i S {1, ... ,to}, compute: 



Vh - 



(fc+l)iV 



^ m (fc+l)^ 

(5^ ^Wft(|7/i| -/3(e,^))+' 
and if Sh > best_improvement, set: 

best_improvement <— 

<— h; 

• if best_improvement > set: 

c/i(„) ^c,j(„) +sgn(7,,(„))(|7ft(„)| - f3{£,h{n)))^; 
until best_improvement < k. 

Return the estimation: 
where: 



Fig. 2. Detailed version of the feature selection algorithm in the transductive case. 

Example 3.2 (SVM and multiscale SVM). Let us choose as the set of all functions X a family 

of kernels Ki,. . . , Krn'{N) for a m'{N) > 1 and: 

00 = {Kh{X,, ■),he{l,..., m'{N)},ie {!,..., (fc + 1)A^}}. 

In this case we have m= {k + l)Nm'{N). We obtain an estimator of the form: 

m'{N} 2N 

E T.'^''''^h{x„x). 

h=i j=i 
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Let us put: 

/, = {je{l,...,27V},a^'V0}- 
We have: 

m'(N) 

h=i jeih 

that is a Support Vector Machine with different kernel estimate; like in Example 2.3, the kernels Kt can be 
the same kernel taken at different scales. 

Example 3.3 (Kernel PC A Kernel Projection Machine) . Take the same and consider the kernel: 

K{x,x') = {^{x),^{x')). 
Let us consider a principal component analysis ( PC A ) of the family: 

{K{X,,-),...,K(X^k+i)N.-)} 
by performing a diagonalization of the matrix: 

( -ft' ( -'^J , -'^^ j ) ) 1 < .J . J < ( fc + 1 ) AT • 

This method is known as Kernel PC A, see for example (19). We obtain eigenvalues: 
and associated eigenvectors e^, . . . , e''^^^'^, associated to elements ofO: 

(k+l)N (k+l)N 

that are exchangeable functions of the observations. Using the family: 
00 = {fci, • ■ • 7 ^(fe+l)Af}, 

we obtain an algorithm that selects which eigenvectors are going to be used in the regression estimation. This 
is very close to the Kernel Projection Machine (KPM) described by Blanchard, Massart, Vert and Zwald (4) 
in the context of classification. 

4. Rates of convergence in Sobolev and Besov spaces 

Wc conclude this paper by coming back to the inductive case. We use Theorem 2.3 as an oracle inequality 
to show that the obtained estimator is adaptative, which means that if we assume that the true regression 
function / has an unknown regularity (3, then the estimator is able to reach the optimal speed of convergence 
^-2/3/(2/3+1) ^ iQg^ factor. 

4-.1. Presentation of the context 

Here we assume that A" is a compact interval of M, that O = h2{Pix)) and that P is such that Y = f{X) + 77 
with 1] independent of X, Prj ~ and P(?7^) < cr^ < +00. 
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Wc assume that {Ok)kew* is an orthonormal basis of O. Wc still have to choose me N and we will take 
G>o = {Si, ■ ■ ■ , &m}- 

Remark that the orthogonality means here that P[9k{X)'^] = 1 for any k eN* , and that: 
for any k' ^ k. 

4-2. Rate of convergence of the estimator: the Sobolev space case 
Now, let us put: 

9"^= argmin R{9) 
eeSpan(eo) 

(that depends effectively on m by 0^ ^ {9i, . . . ,9^}), and let us assume that / satisfies the two following 
conditions: it is regular, namely there is an unknown /3 > 1 and a C > such that: 

\\r-f\\i<cm-'^, 

and that we have a constant B < 00 such that: 
sup f{x)<B 

with B known to the statistician. It follows that: 

\\f\\l<B\ 
If follows that every set, for fc e {1, . . . , m}: 

^k = \^a,9f al<B^^f^0 

is a convex set that contains / and such that the orthogonal projection: Up'"'' — Tip"' • • • Up^ (where Up'' 
denotes the orthogonal projection on J-k) can only improve an estimator: 

y9, \\n^-"'9-f\\l<\\9-f\\l. 

Actually, note that this projection just consists in thresholding very large coefficients to a limited value. 
This modification is necessary in what follows, but this is just a technical remark: most of the time, our 
estimator won't be modified by Tip'™ for any m. 

Remember also that in this context, the estimator given in Definition 2.6 is just: 

Theorem 4.1. Let us assume that = \^2{P{X)) j ^ = [0, 1] mi (^fe)fcgN* m orthonormal basis of . Let 
us assume that we are in the idealized regression model: 

Y^f{X)+7j, 

where P11 = 0, P{i]^) < < 00 and 77 and X are independent, and a is known. Let us assume that f E is 
such that there is an unknown /3 > 1 and an unknown C > such that: 

\\9,n-f\\l<Cm-^P, 
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and that we have a constant B < oo such that: 
sup f[x)<B 

with B known to the statistician. Then our estimator 9 (given in Definition 2.6 with uq = m here, build 
using the bound /3{e,k) given in Theorem 2.1), with e — N~'^ and m — N, is such that, for any N>2, 



p^^[\\n^'^e - ffp] < c'{c, B, a) 



1^X2/3/(2/^+1) 

N 



Here again, the proof is given at the end of the paper (Section 5.7). Let us just remark that, in the case 
where A" = [0, 1], P is the Lebesgue measure, and (0fc)fegN* is the trigonometric basis, the condition: 

\\r -ffp<Cm-'^ 

is satisfied for C = C(/3, L) as soon as / G W{P, L) where W{P, L) is the Sobolev class: 



/ G C^: /('^-^^ is absolutely continuous and / f'^'^\xf\{Ax) < 

Jo 



The minimax rate of convergence in VF(/3,L) is N 2^f/{2^(+i)^ .^g ^^^^^ ggg ii^g^i q^j- estimator reaches the 
best rate of convergence up to a logiV factor with an unknown p. 

4-3. Rate of convergence in Besov .spaces 

We here extend the previous result to the case of a Besov space Bg p q in the case of a wavelet basis (see 
(11) or (12)). 

Theorem 4.2. Let us assume that X = [—A,A\, that P(x) uniform on X and that ('0j.fe)j=o,...,+cx3,/ce{i,...,2J} 
is a wavelet basis, together with a function (j), satisfying the conditions given in (11), with (f> and tpo,! sup- 
ported by [~A,A]. Let us assume that f G Bg.p.q with s > 1 9 < oo, with: 

{oo 2^ 
g: [-A, A] ^ K, g{-) = a^{-) f^J^k^M, 
j=0 k=l 



3=0 



■ 2J -| q/p 



.k=l 



(with obvious changes for p = +oo or q — +oo) with unknown constants s, p and q and that for any x, 
|/(a;)| < B for a known constant B. Let us choose: 

{0i,...,M = WU{V',,fe,J = l,...,2L'°s^/'°s2J,fc = l,...,2^"} 
(so ^ <rn < N ) and e — N^^ in the definition of 9. Then we have: 

p^^m^'^'e - ffp] = o(^(^i^y'^^'''''\iog7V)(i-2/((i+2-^)9))+ 

Let us remark that we obtain nearly the same rate of convergence than in (11), namely the minimax rate 
of convergence up to a log factor. 
For the proof, see Section 5.7. 
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5. Proofs 

The order of the proofs is exactly the order of apparition of the results in the paper, except for the first 
theorem (Theorem 2.1): its proof using lemmas proved in the transductive setting, it is given after the proof 
of the transductive theorems. 

5.1. Proof of Theorems 2.4-2.6 

First, we prove a lemma that is the basis of proofs of Theorems 2.4-2.6. 

Lemma 5.1. We have, for any 9 £ 0, 7 > and 7] > 0: 

„2 ^,3 



and 



PeMlWe - 77) = exp|^y(M^e) + y ^ (1 " PfM^^^^iWe) d/? - r;|, 
Pexp{-jWe - r?) = cxpl^^V{Wg) - ^ ^ (1 - pfM^^^^ {Wg) d/3 - 



Proof. For the first equality, we write: 

logPcxp(7VFe — rj) ~ logPcxp(7VFe) — rj 



^ Ppwe {Wg) dp^f]^ f\j- l3)Vpwe {We)dl3 - ?7 



V{Wg) + I Ml^^ {Wg) d/? - 77 



= Y^iWg) + Y ^ V - (Wg) dp - V- 

For the reverse equality, the proof is exactly the same, replacing 7 by —7. 

We can now give the proof of both theorems. 
Proof of Theorem 2.4. Let us choose k G {1, . . . , m}, for any > and ijk > we have: 



□ 



P^^expj^ - P{Y9,{X))] - 7y,| 



= < Pexp 



^Wg 

N N 



N 



- exp 



3. 

2N 



^(^0, ) + ^ { V - /3)'Aff^A J7V)W«, iWgJd(3 - V, 



by the first equality of Lemma 5.1. By the same way, using the reverse inequality we obtain: 
P«^exp|^fj[P(r0fe(X)) - YMX^)] - %| 

= exp [§^V{We,) /3)^Mf^,^/^)^^^ {Wg,)dp 
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So we obtain, for any A; G {1, . . . , m}, for any > and r//c > 0: 



< 2cxp 



1 ^ 



Vk 



cosh 



< 2 exp 



since, for any x G M, we have: 

cosh(a::) < exp I — 
Now, let us choose e > and put: 



Vk 



3. 

2N 



+ / (i-/5rM(W/^)w.,(w^ejd/3i -fog 



2m' 



We obtain: 



771 f 1 ^ 

P^^J^exphfe -J2y^0k{x,)-p{Yek{xj) 

k=l I 1=1 



2m 



< e 



and so: 



Vfc G {1, . . • ,m}, 



1 ^ 



\og{2m/e) 



> 1 



Now, we put: 



'27Vlog(2m/e) 



We obtain, with P^'^-probabihty at least 1 — £, for any k G {1, . . . , m}: 

N 



^Y^YMX^) - P{Y9,{X)) 



< 



2V{Ws,)\og{2m/e) 
N 

log^/'(2m/£) / 



For short, we take the notation of the theorem: 
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Now, dividing both sides by: 

we obtain: 

\akCk-ak\< 



2ViWe,)log{2m/e) , lliXk/N)log'/\2m./s) 



N 



NViWe.r 



□ 



In order to conclude, just remark that: 

RiakCkek) - R{ak9k) = \akCk -ak\^P[9k{X)% 
Proof of Theorem 2.5. Remark that, for any 6 E 0: 

ViW8)=P{W^)-PiWg)^, 

we wih deal with each term separately. For the first term, let us remark that we obtain the following result 
that is obtained exactly as Lemma 5.1. For any 9 €0: 

Let us apply this result to every 6^ for k G {!,... , m}: 



P^^cxp A, 



1 ^ 



4=1 



Ilk 



where: 
Taking 



A. 



log- 



2m 



and 



l 2N\og{2m/£) 



we obtain that the following inequality is satisfied with P®^-probability at least 1 — |, for any k: 

N I . — . / 

P{Wl)<-Y.Y^9,{X,r 



l2V{Wl)\og{2m/e) log{2m/e) ^ ( /21og(2m/e) 



1=1 

N 



N 



(5.1) 
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for short. Now, wc try to upper bound the second term, — P(VFe)^. Remark that, for any 6: 

1 ^ 

^ J:^Y.^^O{Xi)-P{We) 

i=l 

f 1 ^ 1 ^ 1 

{ i=l i=l ) 

Remember that in the proof of Theorem 2.4 we got the upper bound, with probabihty at least 1 — §, for 
any k: 



N 

i=l 

that gives: 



1 ^ 

-J2YMX^) - P{Y9,iX)) 



^ ^ , 2F(W^eJlog(4m/£) ^ log^/^(4m/£) ^^ f /21og(4m/£) ^ ' 



TV 



NViWe.r 



piwe,r<-(^^f2Y,0kix,)^ +1 

r 1 w 

I 1=1 



21/(W^e J log(4m/£) log^/2(4m/£) ^ / /21og(477i/£) 



N 



NV{WeJ 



3 "fc 



2F(W^9j log(4m/£) log^/2(4m/£) ^ / /21og(4m/£) 



N 



NV{W9, 



\3 "I' 



NV{WeJ 



(5.2) 



for short. Let us combine inequahties (5.1) and (5.2). We obtain that, with probability at least 1 — £, for 
every k we have: 



TV / N \ 2 

V{We, ) = P{Wl ) - PiWe, )' < - J] K,^^?^ (X,)^ - - ^ K,^^, (X,) + ^a. + Bk = % + A + 



□ 



Proof of Theorem 2.6. This proof is a variant of the proof of Theorem 2.4, the method it uses is due to 
Seeger (20). Let us define, for any i e {1, . . . , N}: 

P,(.)^P®^(.|Z,). 

Let us choose (i, fc) G {1, . . . , N} x {1, . . . , m'{N)}, for any A.^^fc — Xi^k{Zi) > and rji k — rj.i^k{Zi) > we have: 



^.cxp|^^^K0,,,.(X,) - P{Y6.,^k{X))] - 



< exp 



X3 



2{N ^ 



by the first equality of Lemma 5.1. In the same way, we obtain the reverse inequality and, combining both 
results, for any (i, A:) G {1, . . . , A^} x {1, . . . , m'{Ny}, for any A^^fc > and T]i^k > 0: 



P^ expj A,,fc ^o^'^AX,) - P{Y6,,k{X)) 



< 2 cxp 

< 2 exp 



A? 
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2{N-1) 



cosh 



2(7V-1)2 

j2 



where: 



for short. Now, let us choose e > and put: 



We obtam: 



>2 



2{N -1) 



8(iV- 1) 



jllk - log 



2Nm'{N)' 



N 7n'{N} 



i=i k'=i ^ 



\,k 



AT m'(Af) , 

^p«iv^ E ^.exp A,,, ^^E^^^^'=(^^)-^(^^'.'=W) 
i=i fc'=i 



X2 



2{N -I) 



X6 



8(Ar- 1)4"^-'=^^°^ 2Nm'{N) 



< £. 



Now, we put: 



A,- i. 



/2iVlog(2iVm'(iV)/e) 



and achieve the proof exactly as for Theorem 2.4. 
5.2. Proof of Theorems 3.1 and 3.2 



□ 



Here again, the first thing to do is to prove a general deviation inequality. This one is a variant of the one 
given by Catoni (6). Wc go back to the notations of Theorem 3.1 and 3.2. with test sample of size A^. 

Definition 5.1. Let Q denote the set of all functions: 
g:{X X Rf^ x ^ M, 

(Zi, . . . , Z2N,u, u') ^ g{Zi , . . . , Z2N,u, u') = g{u, u') 
for the sake of simplicity, such that g is exchangeable with respect to its 2N first arguments. 

Lemma 5.2. For any exchangeable probability distribution V on {Zi, . . . , Z2n), for any measurable function 
7]:{X X R)2^ — > M that is exchangeable with respect to its 2 x 2N arguments, for any measurable function 
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A : (A" X M)^^ ^ M.*^ that is exchangeable with respect to its 2 x 2N arguments, for any 6 E and any g G G : 

\ i=l i=l / 

and the reverse inequality: 

( \ ^ X2 SAT \ 



where we write: 

X = X{{Xi,Yi),...,{X2n,Y2n)) 



for short, and: 
Proof. In order to prove the first inequality, we write: 



2 if g is nonnegative, 
otherwise. 



^""Mn ll^a[e{x^+N). Y+n] - g[e{x,). y,]} - ^ y,]' - 77 

\ 1=1 1=1 / 

= 7'exp( ^logcosll|-g[0(X,+^.),y,+^.] - -g[e{X.,),Y,]^ _ - • 

This last step is true because V is exchangeable. We conclude by using the inequality: 

G R, log cosh x<—. 
We obtain: 

logcoshj - Ag[0(x,),y,]| < ^{g[e{x,^^),Y,+N]^g[e{X..),Y,]f 

<^2mx.iY.f. 

The proof for the reverse inequality is exactly the same. □ 

We can now give the proof of the theorems. 

Proof of Theorem 3.1. From now on we assume that the hypothesis of Theorem 3.1 are satisfied. Let 
us choose e' > and apply Lemma 5.2 with r] = — loge', and g such that g{u, u') = uu' . We obtain: for any 
exchangeable distribution V, for any measurable function A: {X x M)^^ that is exchangeable with 

respect to its 2 x 2N arguments, for any 9 G 0: 

( X N 2 2W \ 

^'^^P -J^Y}^{X^+N)Y.+N - e{X,)Y,] ^ —Y^e{X,fY^ + \oge' <e' 

\ i=l i=l / 
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and the reverse inequality: 



N 



Let us denote: 



/(0,e',A) = A 



N 2 2Ar 



loge' 



2N 



(5.3) 



The previous inequahties imply that: for any exchangeable P, for any measurable function X:{XxM.) 
that is exchangeable with respect to its 2 x 2N arguments, for any 9 G 0: 

re^pfi{Z,,...,Z2N),0,e',X)<2s'. 

Now, let us introduce a new conditional probability measure: 

^ ~ f2iV)' E ^(-^"i.^"i)ie{i,.-.,2iv}- 

Remark that P2N being exchangeable, we have, for any bounded function h:{Xy. M)^^ M, 

P2Nh^P2N{Ph). 

The measure P is exchangeable, so we can apply Eq. (5.3). For any values of Zi, . . . , Z2N we have: 

V0 e e, Pexp /((Zi , . . . , Z2w), 0, e'. A) < 2£'. 

In particular, we can choose 9 = 9{Zi, . . . , Z2n) as an exchangeable function of {Zi, . . . , Z2n), because we 
will have: 

— — ^ CXpf{{Z„i^i-f,...,Z„i^2N)),&{Za{l),---,Zcr(2N)),£'A) 

^ '' a-ee2N 

= 72^! E cxp/((Z^(i),...,Z^(2jv)),^'(2'i,...,^2w),e',A)<e'. 

Here, we choose as functions 9 the members of 6*0: ^^i, • • • ,^rn (remember that we choose this indexation 
in such a way that for any fc, 9k is an exchangeable function of (Zi, . . . , Z2n))- We have, for any Ai, . . . , Am 
that are m exchangeable functions of (Zi, . . . , Z2n)'- 

F2jv[3fc e {1, . . . , ?7i}, /((Zi, . . . , Z2n), 9k.e', Xk) > 0] 

= P2N 



<P 



|J{/((Zi,...,Z2jv),0fc,e',Afc)>O} 
fe=i 

m 

^l(/((Zi,...,Z2jv),efc,e',Afc)>0) 

fc=i 

= P2nP J2 • • • ' ^2W), ^fe:^', Afe) > 0) 

.fe=l 

= P2Ar ^ P[l(/((^1, . . . , Z2Ar), e', Afc) > 0)] 



fe=l 
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ier 



ni 

<P2Nj2PexpfiiZi,...,Z2N),ek,e',Xk). 

k=l 

Now let us apply inequality (5.3), we obtain: 

m 

P2N [3ke{l,...,m}, f{{Zi ,...,Z2n), Ok,e', Afe) > 0] < P2N 2e' - 2e'm = e 



fe=i 



if we choose: 
2m 



From now, we assume that the event: 

jvfc e {!,..., m},/(^(Zi,...,Z2w),0fc,^, Afe) <o| 

is satisfied. It can be written, for any fc G {1, . . . , m}: 
1 ^ 

1=1 

Let us divide both inequalities by: 

2N 

4 E ^^i^^r- 



i=N+l 



We obtain, for any fee {1, . . . , m}: 

It is now time to choose the functions . We try to optimize the right-hand side with respect to Afc , and 



obtain a minimal value for 
Afe 



iVlog(2m/£) 



This choice is admissible because it is exchangeable with respect to (Zi, . . . , Z2n)- 
So we have, for any S {1, . . . , m}: 



,.fe fe fe, < Ji^imT:i^^mX,?Y^]Xo^{2m/e) 

Finally, remark that: 

/ r2[(Cfc4)gfc]-r2(a§gfe) 

which leads to the conclusion that for any /s G {!,..., m}: 

r2[[C ai)0fej-r2(a20fe)<2 2iv V7V^2 

(1/^) l^i=N+i^k[x,y 
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This ends the proof. □ 
Proof of Theorem 3.2. We write: 

2N ^ N ^ 2N 



2 



i=l 4=1 i=N+l 

and try to upper bound the second term. We apply Lemma 5.2, but this time with g such that g{u) = {uu'Y 
that is nonnegative, and obtain, for any e, for any (exchangeables) 9 and A: 

^ E ^^(^o^^.^4E^^(^^)^>^.^ + 4(^)E^^-(^o^^^^ 

We choose: 



2iVlog£ 



we apply this result to every 6 (z Oq, and combine it with Theorem 3.1 by a union bound argument to obtain 
the result. □ 

5.3. Proof of Theorem 3.3 

First of all, we give the following obvious variant of Lemma 5.2: 

Lemma 5.3. For any exchangeable probability distribution V on {Zi, . . . , Z2n), for any measurable function 
1] : {X X R)^^ — > R that is exchangeable with respect to its 2 x 2N arguments, for any measurable function 
X: {X X R)^^ R^ that is exchangeable with respect to its 2 x 2N arguments, for any 9 G 0: 



Pexpj - Y,{[HX^+N)Y.+N - a{e)0{X,+Nf] - [9{X,)Y, - a{9)9{X,f]} 



i=l I 

and the reverse inequality, where: 
a{9) = argminri^2(a^')- 

aGffi 

Proof. This is actually just an application of Lemma 5.2, we just need to remark that a{9) is an exchange- 
able function of {Zi, . . . , Z2n), and so we can take in Lemma 5.2: 

g{u, u') = uu' — u'^a{9), 

that means that: 

g[9{X,),Y,] = 9{X,)Y, - a{9)9{X,f. □ 

Proof of Theorem 3.3. Proceeding exactly in the same way as in the proof of Theorem 3.1, we obtain 
the following inequality with probability at least 1 — e: 



r2{C''a\9k)-r2{a'^9k)<^ 



{l/N)Y.Zi[(^k{X,)Y, - al^OkiX, 
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This proves the theorem. 



□ 



Before giving the proof of the next theorem, let us see how we can make the first-order term observable 
in this theorem. For example, we can write: 



+ 2[ek{Xi)Y, ~ alOkiX.fWa'l - al^]ek{Xif . 

Remark that it is obvious that: 

l«?-42l<l«?-«2l, 

and so: 

[ek{X,)Y, - al^ek{X,ff < [9k{X,)Y, - a%{X,ff + [aj - a!^f9k{X,)^ 

+ 2\0k{X,)Y - aj0fc(X,)'||a^ - a^MX^f■ 

Now, just write: 

ai — a2 = [i^ — L )ai — (C ai ^ 
and so we get: 

[ekiX,)Y, - al^OkiX,)^? < [Ok{X,)Y, - a^OkiX^ff + [C'^a'l - a^2?dk{X,f 

+ 2|C'=aJ - a^lKl - C'')a1\ek{X,y + (1 - C^)'(a^)X-(X^)" 
+ 2\9kiX,)Y, - a'iek{X,)^\\C^a'l - al\ek{X,)^ 
+ 2\0k{Xi)Y, - alekiXifWiC" - l)a1\9k{X,)^. 

So finally, Eq. (5.4) left us with a second degree inequality with respect to jC'^aJ;' — | or r2{C''a'i9k) — 
''2(a§0fc) that we can solve to obtain the following result: with probability at least 1 — e, as soon as we have: 



2i2 



2JV "I ^ r 27V 



41og(2m/e) 
N ' 



which is always true for large enough N, the quantity |C — ! belongs to the interval: 



2 log(2m/e) b±^b^+ a((iV/log(2m/e))[(l/iV) ^ -fAr+i OkiX,)'? ~ (4/iV) E -fi Ok{X.V) 



^2N 



^ [a/N)j:'^^^^,9kiX.W ~ {4log{2m/s)/N)[{l/N)j:t:,9kiX.r^ 

with the following notations: 



^2N 



^ 2N 

- Y,[\9k{X,)Y - al9k{X,r\ + 14(1 - C')\9kiX.r]\ 



i=i 

2N 



b=^J2 2^fe(^0'[l«fe(l - C'MiX,)^ + \9k{X,)Y, ~ a'^9k{X,)% 



N 

1=1 

Remark that only one of the bounds of the interval is positive. So we obtain the following result: with 
P2Ar-probability at least 1 — e, as soon as: 



^ 2JV 1 ^ r 27V 



=Ar-|-l 



i=l 



41og(2m/e) 
N 
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we have: 



Vfce{l,...,TO}, 



< 



4l0g^(2TO/£ 



^ 27V 



b+ + aiiN/logi2m/s)m/N)j:^^^^,e,{X,r]^ - (4/jV) ^fe(^O^) 



^2N 



Wc can notice that this bound may be written: 



kn ^ / 8alog(2m/£) 



N 

!log(2?«/e) 
TV 







2N 



log(m/e) 



N 



3/2 





log(m/e) 






TV 





The next step would be now to replace the bound by an observable quantity, by getting a bound like: 

2N ^ N 

N ■ 



2N N 

- ^(0fe(xor. - alOkix^f f < -Y.^ek{Xi)Y, - a'iek{x,ff + o 



log(m/£) 



TV 



=1 1=1 
with high probability. This can be done very simply, using Lemma 5.2 with this time 

g{u, u') = (uu' — v?'a{9))^ . 

We obtain the bound: 



161og(4m/£) 
TV 



1 ^ 

-Y^{eu{Xi)Y,- a\ek{Xifr 





\og{m/e) 






TV 





r2[{C''a\)ek]-r2{atek)< 
5.4- Proof of Theorem 3.4 

The proof is exactly similar, we just use a new variant of lemma 5.2, that is based on an idea introduced by 
Catoni (8) in the context of classification. 

Definition 5.2. Let us write: 

Te{Zi)^e{Xi)Y, 
for short. We also introduce a conditional probability measure: 

■p{2)^±_ Y 5(7 7 7 7 

N\ ' ' 1^i.---.^Jv,^jv + ct(i),---.^jv + ct(jv)) 



Remark that, because V is exchangeable, we have, for any function h: 
Vh = V[V^^^h]. 
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Lemma 5.4. For any exchangeable probability distribution V on {Zi, . . . , Z2N), for any measurable function 
J] : {X X R)^^ — > R that is exchangeable with respect to its 2 x 2N arguments, for any measurable function 
X:{X X R*^ which is such that, for anyi€{l,..., 2N}: 

X{Zi, . . . , Z2n) = A(Zi, . . . , Zi-i, Zi+N, Zi+i, . . . , Zi+N-i, Zi, Zi+N+i, • ■ • , Z2n), 

for any 9 (z 0: 



7-'exp< 



-p(2)\ ^ 



N 



1 

2iV2 



N 



Y,[Te{Z.O-Te{Z,+N)Y 



r/ > < Pexp(— 77) 



and i/ie reverse inequality. 

Proof. Let Chs denote the left-hand side of Lemma 5.4. For short, let us put: 

AT N 



Then we have: 



N 



Chs = P2N expP(2) ( ^Y}Tg{Z,) - Tg{Z,+N)] - ^s{6) - 



N 



< P2nP^^^ cxp ( ^Y.^Tg{Z,) Tg{Z,+N)] ~ ^s{0) - ) , 



2N 



by Jensen's conditional inequality. Now, we can conclude as in Lemma 5.2: 
Chs = P2ivexp|^^logcosh| A[T,(Z,) - Tg{Z,+N)]j - - 77^ 



N 



^^Py-^J2lTeiZ,) - Tg{Z,+^)f - —si9) ^ 



<P2N 



□ 



Proof of Theorem 3.4. We apply both inequalities of Lemma 5.4 to every 9k,k G {1, . . . , m}, and we take: 

A = 



/2iVlog(2m/£) 



s{9) ■ 
We obtain, for any fee {!,..., m}: 

Ve^pl ——J2[TeiZ,) - Tg{Z,+N)] - log 



2m 



r]><e. 



Or, with probability at least 1 — e, for any k: 

N 

N- 



^f:[TgiZ.)-TgiZ.,.)]<^'-^^^^^ 
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N 2N 



z=l i=N+l 

We end the first part of the proof by noting that: 

N , 2N 



p^^ls{0) = Vi{e) + V2{e) + 



i=l 



i=N+l 

Now, let us see how we can obtain the second part of the theorem. Note that: 

2 



V2{0) 



2N / 2N 

i=N+l \ i=N+l 



We upper bound tlie first term by using Lemma 5.2 witli g{6{Xi), Yt) — 9{Xi)^Y^ — Te{Zi)'^, so with prob- 
abihty at least 1 — e, for any k: 



1 V T(-^^^^ ^ \-T(-^^ I , 21og(m/£)(l/iV)E?fir.(^0 



i=N+l 



For the second-order term, we use both inequalities of Lemma 5.2 with g{d{Xi), Y^) = 9{Xi)Yi = Te{Zi), so 
with probability at least 1 — e, for any k: 

N \2/2W ^ 2N ^ 2N 

-Er«(^,) - U E ToizM < j^T.Te{Z.)-- E Te{Z.) ^T^TeiZ.) 

i=l / \ i=N+l / i=l i=N+l i=l 



{l/N)Y.ZiTe{Z,Y\og{2m/e) 1 



<2\ 



2N 



N 



N 



Y.\^e{Z.) 



Putting all pieces together (and replacing e by e/3) ends the proof. 
5.5. Proof of Theorem 3.5 



□ 



Proof of Theorem 3.5. We introduce the following conditional probability measures, for any i G 
{1,...,7V}: 

P ' 



(fc + 1)! 



E h 



and 



(■Zl ,Zjv(cr(l)-l) + i ,Zi+i,...,Zjv + i-l,2jV(o-(2)-l) + i JV + i+1 -i-ZfcN + i-l i-Z JV(o-(fc + l)- l) + i ,2feN + !+l,---,-Z(fc + i)jv) • 



N 

p = (g)p, 

1=1 
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and, finally, remember that: 

{k+l)N 

(k + l)N ^ 

Note that, by exchangeability, for any nonnegative function 
/i:(A'xR)('=+i)^^R 
we have, for any i S {1, . . . , N}: 

P{k+l)N^ih{Zi, . . . , Z2n) = P(k+l)Nh{Zi, . . . , Z2n)- 



Lemma 5.5. Let x be a function 
e:{X X R)('^+i)^ ^0 we have: 

{k+l)N 



For any exchangeable functions X, rj: [X X M)('^+i)^ ^M_^ and 



( r , {k+l)N N 1 \ 

y L i=N+l i=l J J 



< cxp(— 77) cxp 
A3(l + fc)3 



A2(l + fc)2 



2iVP -P{[x(e(W)-Px(e(W)] } 



sup , inf ^ ,X(^(^.)^^) 

ie{i,...,(fe+i)w} ie{i....,(fc+i)w} 



l3 



where we put A = A(Zi, . . . , ^(fc+i)7v); ^' = ^(^i? • ■ • i ^(fc+i)Ar) o.'^'d 7] = . . . , /or short. We have 

the reverse inequality as well. 

Before giving the proof, let us introduce the following useful notations. 

Definition 5.3. We put, for any 9 G 0, for any function x-' 

x1 = x{Y^e{x,)), 

and 

x'^xiYOiX)) 
that means that: 

(k+l)N 



We also put: 
5^(9) = 



l)iV ^ 



ik + l)N ^ 



sup Xi - inf Xi- 
ie{i....,{k+i)N} ie{i,...,{k+i)N} 



Proof of the Lemma 5.5. Remark that, for any exchangeable functions A, 77: (A" x R)('=+i)^ ]R-|_ and 
0:{X xR)''^ ^0 we have: 



Pexp<^ A 



(fc+l)JV AT 



kN 



i=l 



N 
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N 



:exp(-r?)[]exp<^ — ^ 



i=i 



TV' 
exp 



A(i + fc) 
/cA^ xz 
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where we put A = A(Zi, . . . , Z^i^), 6 = 6{Zi, . . . , ^fc^v) and rj = 77(^1, . . . , ^fcAr) for short. 
Now, we have: 



N 



log Y\ Pi exp 



A(l + fc) e 



^fogP^expj - 



A(l + fc) 



and, for any i G {!,..., A^}: 



logP.exp^-^X^ 



A(l + fc)^ , , A2(l + fc)^ 



-Pa: 



X{l+k)/(Nk) 



1 /A(l + fc) 



2 V iVfc 



P,cxp[-/3x,' 



P,cxph/3xf] 



en \ 3 



cxp(-/?x') 



d/3. 



Note that, for any (3>0: 



1 



P,exp[-/3xf] 
and so: 



P,exp[-/?x'] 



cxp(-/?x-) 



< 



sup x.+o-i)w - inf x,+(,-i)iv 



Nk \ - k "-^^ N ^ oArL2 ^nlXj ^^iXjJ 



logJ^P^exp 



TV ^ 27Vfc2 

i=l 



sup Xi - mi Xi 

l-ie{l,...,(fc+l)Af} iG{l,...,(fc+l)Ar} 



6iV2p 



Note that: 



1 



and so: 



N 



. , (fc+l)JV 

7V^ (fc+l)iV ^ ' 



remark also that: 
N 



. , (fe+l)7V 



(fc+l)7V 

(fc+i)7v E 



P[(x'-PxT], 
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we obtain: 

Pcxpi A 



N 



- Y: 0{X.)Y.--Y,B{X.)Y. 



kN 



i=N+l 



exp(— 7y) exp 



■P[(x'-Px')1 



A3(l + /c)3 



The proof of the reverse inequahty is exactly the same. 



t> ■ r t 

sup Xi - mi Xi 

ie{l,...,(fc+l)Af} iG{l,...,(fe+l)W} 



□ 



Let us choose here again x such that x(^) = ^! namely: x = id- By the use of a union bound argument 
on elements of Oq we obtain, for any £ > 0, for any exchangeable fimction A : (A" x ]R)('=+i)^ ]R_|^, with 
probability at least 1 — e, for any /i G {!,..., m}: 



N 



kN V TV 



< 



2iV 



6iV^ 



Let us choose, for any /i € {1, . . . , m}: 
A = 



2A^log(m/£) 



the bound becomes: 

^ {k+l)N N 

kN ^ 0,iX,)Y,^-YOH{X^)Y, 



1=1 



2N 



3NP[{x'^t' -Px'^"?] 



We use the reverse inequality exactly in the same way, we then combine both inequality by a union bound 
argument and obtain the following result. For any e > 0, with P[k+i)N probability at least I — e we have, 
for any h € {1, . . . , m}: 



2Yg, log(2m/£) 



N 



:i/(fc^))E!=1v+>'^(^o^ 

, 2(log(2m/e))3/25,,(0,.)3 {log{2m/e))^SM'' 



3^3/2vi/2 

Oh 

remember that: 

Ye^P{mx)Y)-P{0{X)Y)f}. 
We now give a new lemma. 
Lemma 5.6. Let us assume that P is such that, for any h S {1, . . . , m} 

3ph > 0, 3Bh > 0, Pcxp{l3h\0hiX)Y\) < Bn- 



(5.5) 



Iterative feature selection in least square regression estimation 



83 



This is for example the case if 6h{Xi)Yi is suhgaussian, with any Ph>0 md 



^ 2 

Then we have, for any e > 0; 

P(.+i)Ar( sup 9,{X,)Y, <^\og (k + mB,. ^^^ 

il<i<{k+l)N Ph £ 

Proof. We have: 

Pi 



\k+i)N( sup 0h{X,)Y,>s) -P(fc+i)jv(3ie{l,...,(fc + l)7V},0„(X,)K,>.s) 

^l<i<(fc+l)Ar ^ 



{k+l)N 



1=1 



< {k + l)NPexpiPh\0h{X,)Y,~s\)<ik + l)NBheM-M- 

Now, let use choose: 

1 , ik + l)NBh 
5=^log ; ' 

Ph £ 

and we obtain the lemma. □ 

As a consequence, using a union bound argument, we have, for any e > 0, with probability at least 1 — £, 
for any ft, g {1, . . . ,m}: 

sup eh{X,)Y,- inf gh(x,)y.< j-log ^^^ + ^^"'^^\ 

ie{l,...,{k+l)N} i€{l,....{k+l)N} Ph £ 

By plugging the lemma into Eq. (5.5) we obtain the theorem. □ 
5.6. Proof of Theorem 2.1: integration of the transductive results 

Actually, the proof is quite direct now: instead of using the techniques given in the section devoted to the 
inductive case, we use a result valid in the transductive case and integrate it with respect to the test sample. 
This idea is quite classical in learning theory, and was actually one of the reason for the introduction of 
the transductive setting (see (22) for example). There are several ways to perform this integration (see for 
example (6)), here we choose to apply a result obtained by Panchenko (17) that gives a particularly simple 
result here. 

Lemma 5.7 ((17), Corollary 1). Let us assume that we have i.i.d. variables Ti, . . . ,Tn (with distribution 
P and values in R) and an independent copy T' ~ (T{, . . . ,Tj^) of T = (Ti, . . . ,Tn). Let ^j(T,T') for j £ 
{1,2,3} be three measurables functions taking values in M, and ^3 > 0. Let us assume that we know two 
constants A>1 and a > such that, for any u> 0: 



P^^^MT,T')>^,iT,T') + ^C3{T,T')u]<AcM~au). 
Then, for any u> 0: 

P'^^^{P^^^[Ci{T,T')\T] > P®2A^[6(r,T')|T] + Jp(S2N[^^(^T^T')\T]u} < Acxp(l - au). 



84 



P, Alquier 



Proof of Theorem 2.1. A simple application of the first inequality of Lemma 5.2 (given as a tool for the 
proof of the transductive results) with e > 0, any fc e {1, . . . , m}, g ^id, 1] = 1 + log — and: 



Nrj 



leads us to the following bound, for any k: 



P«52Wexp 



<exp(-77), 



p®2JV 



1 ^ 



\ 



At] 



2N 



< exp(-7;) 



2fcexp(l)' 



We now apply Panchenko's lemma with: 

Ti = 9k{Xi)Yi, Tl ^ 6k{Xi^N)Yi-i-N, 



N N 



i=l 
2N 



^,iT,r) = — Y,Ok{x,)X'>o, 



i=l 



and A — a = 1. We obtain 

N 

N 



p02N 



1 ^ 

-J2MX,)Y,-P[dkiX)Y]]> 



i=l 



\ 



N 



<exp(l = 



Remark finally that: 

P[9k{XfY^]<P[9k{XfKB^+<j^). 

We proceed exactly in the same way with the reverse inequalities for any k and combine the obtained 2m 
inequalities to obtain the result: 



f 1 

P^^'l 3fc e {1, . . . ,m}, - ^ \9k{X,)Y - P[9k{X)Y] 

[ i=i 



> 



\ 



( 1 

P«2^ 3fc e {1, . . . , m}, - J2 mX,)Y, PmX)Y]\ 

[ i=l 



> 



\ 



i=l J 



that ends the proof. 



□ 
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5.7. Proof of Theorems 4.1 and 4.2; Theorem 2.3 used as an oracle inequality 



Proof of Theorem 4.1. Let us begin the proof with a general m and e, the reason of the choice m = N 
and e = iV^^ will become clear. Let us also call 8{e) the event satisfied with probability at least 1 — e in 
Theorem 2.1. We have: 



P«^[||i7^'"^ - fWl] = [l£(,) II il^'"^ - fWl] + P«^[(l - 1^(,))|17T^'™0 - ff 



First of all, it is obvious that: 

- l£(e))l|i7^'"^ - fWl] < 2P«^[(1 - l£(e))(||iT^'"elp + 

< 2e{B^m + P^) = 2e{m + 1)5^ 
For the other term, just remark that, for any m' < m: 



n^'^'O - /lip = ||7T;''"7T™- . . . il^-'O - /lip < 1177]?-' • • • 77^^^0 - /Hp < ||77^' ' ■ ■ ■ H'/O - /||p 



< 



E 

fc=l 



4[1 + log(2TO/£)] 

N 



N 



\0m' - /Hp- 



This is where Theorem 2.3 has been used as an oracle inequality: the estimator that we have, with m>m', 
is better than the one with the "good choice" m' . We also have: 



P'"'[leie)\\n^'"'0-f\\p]<P' 



E 

.fc=i 



4[l + log(2m/e)] 
N 



1 ^ 



4=1 



,8[l+log(2m/£)] 2 2 
< m — [B + a \ 



N 



So finally, we obtain, for any m' < to: 



P«^[||77^-"0" - fWl] < ,^/ 8[l + M2m/g)] [^2 ^ ^2] ^ + 2e(TO + 1)B^ 



N 



The choice of: 



N 



log AT 



1/(2/3+1) 



leads to a first term of order iV-2'3/(2/3+i) log 2n(log Ar)2/5/(2/3+i) ^nd a second term of order TV-^/^/ls/^+i) x 
(logA^)2/3/(2,3+i). The choice of to = and e = iV-2 gives a first and a second term of the desired order 
^-2/3/(2/3+1) (-jQg ^^2/3/(2/3+1) ^^iHe keeping the third term at order N^^. This proves the theorem. □ 

Proof of Theorem 4.2. Here again let us write £{e) the event satisfied with probability at least 1 — e in 
Theorem 2.1. We have: 

P«^[||77^'"^ - fWl] = P«^^[l£(,)||77j-"^ - ffp] + P«^^[(l - 1^(,))||77^''"^ - /ll^p]. 

For the first term we still have: 



|77^'™0-/||^<2(TO+1)P2 
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For the second term, let us write the expansion of / into our wavelet basis: 

/ = at/) + ^ ^ Pj.kipj.k, 
j=o fc=l 

and 

J 2^ 

e{x) = + X! X! ^J^fcV'j,fc 

j=0 fc=l 

the estimator 9. Let us put J = 2L(i°g^)/i°g2j ^ 

J 2^ oo 2^ 

J 2-^ J 2^ oo 2^ 

j=Ok=l J = Ofe = l j = J+lA;=l 

for any k > 0, as soon as f (e) is satisfied (here again we used Theorem 2.3 as an oracle inequality). Now, 
we follow the technique used in (11) and (12) (see also the end of the third chapter in (7)). As soon as f (e) 
is satisfied we have: 



J 2^ 

EE(/5.'fc-/3,,fc)'i(i/?,,fci>A^)< 

j=0 fc=l 



< 



In the same way, we have: 

J 2' 



8(^2- 


f (T2)log(2m/e) 




N 


8(^2- 


f cr2)iog(2m/e) 




N 


8(^2- 


^(j'^)\og{2m/e) 




N 


J 


2^ 



EEl(l/3,,fel>«^) 

J 2^/1 1^ 2/(2s+l) 
\PjM 

j=Ok=l 

J 2J 

^-2/(2s+l)^^|^^.^|2/(2.+l)^ 
j=0 fe=l 



,2-2/(l+2s) 

j=Ok=l j=Ok=l 

So we have to give an upper bound on the quantity: 

EEi/3..p/<^^+^^. 

i=o fe=l 

By Holder's inequality we have, as soon as p > 2rfi '' 



EEi/3..p/<^^-^^^<E 

i=0 k=l j=0 



2^ 



2j(l + l/2-l/p)^|^^.^|p 



fc=l 



2/(l+2s 



< ||/||2/(l^+2s)j(l-2/((l+2s)9))+^ 
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let US put C" = |1/||s!p,V^^''''- Finally, note that wc have, for p>2: 

j=J+lk=l j=J+l\k=l I 

As / G i?s,p,g C i?s,p,oo we have: 

E'^i,fc <c'2-2j(-+i/2-i/p) 

for some C" and so: 
00 1' 

E E/^l.fc^^"'2~''^ 
j=j+i fe=i 

for some C". In the case where p < 2 we use (see (12). for s > ^ — i): 
to obtain: 

E E ^Ik ^ C""2-2.^(«+i/2-i/p) < c""2-'. 
j=j+i fc=i 

So we have: 

pr^N^2^f^ /) < 2(m + l)e(i?2 + + 8(-B'+^'^l0g(2m/£) (^ ^ ^/^-2/(l+2s) j(l-2/((l+2s)g))+^ 

III /n — ,J\2s ^Illln — J 



Let us remember that: 



— <m = 2-'<N 
2 - 

and that e = A^"^, and take: 



K - 



\osN 



N 

to obtain the desired rate of convergence. □ 
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